Alert - this site and blog posts are under development - Karl M.
The Goal of Statistical Learning - estimating \(f\)
machine learning
r
statistics
My take on machine learning based on statistics
Author
Karl Marquez
Published
August 7, 2025
What is statistical learning?
The goal of statistical learning is to develop an accurate model that can be used to predict output based on inputs. Inputs can be referred to as predictors, independent variable features, or just variables. The output is often called the response or dependent variable.
TV, radio, and newspaper seems to have a positive association to sales. The more the company spends on TV advertisement, the higher the associated sales. Newspaper seems to be the weakest advertising method.
For the three plots above, I inserted a line that results in the least amount of errors (the space between points and the line, given fixed x). Some errors are positive (points above the line), and some errors are negative (points below the line). To resolve this, we square the errors (thus, the least squares fit).
In a mathematical equation, it looks something like this: \[Y = f(X) + \epsilon\] Here, \(f\) is a fixed function of inputs, and \(\epsilon\) is a random error term, independent of the inputs. Overall, the errors have approximately a mean of zero.
To visualize, see below’s example:
Show me the code
income1 <-read.csv("Income1.csv")income1 <- income1 %>%mutate(res =residuals(loess(Income ~ Education)))incomeplot1 <-ggplot(data = income1, aes(x = Education, y = Income)) +geom_point(alpha =0.75, size =2.5, color ="steelblue") + karl_themeincomeplot2 <-ggplot(data = income1, aes(x = Education, y = Income)) +geom_point(alpha =0.75, size =2.5, color ="steelblue") +geom_smooth(method ="loess", se =FALSE, color ="blue", formula ="y ~ x") +geom_segment(aes(xend = Education, yend = Income - res), color ="red") + karl_themegrid.arrange(incomeplot1, incomeplot2, ncol =2)
The blue curve represents the relationship between income and years of education.
In the real world, the function \(f\) depends on more than one input. Adding another variable (seniority), the relationship tracking income using Education and Seniority is now a plane, a two-dimensional data that is estimated based on the two variables observed.